Skip to content

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62

Open
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
dusterbloom:fix/issue-55-stable-kv-pad
Open

dflash: split target/draft StepGraphs to fix ggml_gallocr realloc per spec-decode step (issue #55)#62
dusterbloom wants to merge 3 commits intoLuce-Org:mainfrom
dusterbloom:fix/issue-55-stable-kv-pad

Conversation

@dusterbloom
Copy link
Copy Markdown
Contributor

Fixes #55.

Root cause

Every spec-decode iteration calls build_target_step_tree (target verify, ~3127 ggml graph nodes) at dflash/test/test_dflash.cpp:1703 and build_draft_step (draft forward, ~186 nodes) at dflash/test/test_dflash.cpp:1556 on the same StepGraph sg, sharing one ggml_gallocr. ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes, so every call sees a mismatch left over from the previous call's opposite topology — forcing ggml_gallocr_reserve to re-walk the entire graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer (GPU driver cost).

Reporter on Windows/RTX 4090/CUDA 13 sees ggml_gallocr_needs_realloc: graph has different number of nodes log spam every step (the message is #ifndef NDEBUG gated, so Linux Release builds are silent but pay the same cost). Decode tok/s halves from 90 @ 16k context to 55 @ 32k context.

Fix

Split the shared StepGraph sg into target_sg and draft_sg, each with its own ggml_gallocr. Target verify settles into the 3127-node topology, draft into 186-node, neither bounces.

 dflash/test/test_dflash.cpp | 42 ++++++++++++++++++++++++++++--------------

The diff is minimized via a StepGraph & sg = target_sg; alias so the existing prefill/target-verify call sites are unchanged; only the draft block (~10 references) swaps sg.X for draft_sg.X. Daemon-mode reset and the two migrate-cache sites destroy both StepGraphs.

Verification

Patched ggml_gallocr_alloc_graph to unconditionally fprintf to stderr at each "needs_realloc returns true" site (removing the #ifndef NDEBUG gate). Ran test_dflash on a tokenized HE prompt + n_gen=256 + --ddtree-budget=22 + --max-ctx=2048 + --fast-rollback. Same prompt, same flags, before vs after this commit:

Before After
needs_realloc events over 26 steps 56 3 (initial only)
cudaFree+cudaMalloc events during decode 14 0

Reasons breakdown before fix:

  • 26: n_nodes 186 → 3127
  • 25: n_nodes 3127 → 186 (the alternation)
  • 3: per-tensor size grew (monotonic kv_pad growth)
  • 2: initial reserves

After fix: just the 3 initial reserves (one per gallocr, each fired exactly once at first use).

Bench (RTX 3090 / Linux / CUDA 12.6)

bench_he.py --n-gen 128 --ddtree-budget 22, 3-run mean:

  • main: 86.72 tok/s (runs: 85.38, 88.05, 86.72)
  • this fix: 84.99 tok/s (runs: 84.98, 81.71, 88.30)

Within bench-noise. This is consistent with the hypothesis that on Linux/CUDA 12.6 the per-step cudaFree+cudaMalloc cost is small (driver fast-paths the alloc), so eliminating it doesn't show up as decode tok/s. The reporter's Windows/CUDA 13 stack has a slower stream-allocator where the saved cost should translate into measurable tok/s recovery — needs verification on their box.

Test plan

  • Build clean (Release).
  • bench_he.py parity (within noise).
  • Bit-exact correctness preserved across the fix (same step count, same accept length, same final committed token).
  • Instrumentation confirms the realloc-per-step pattern is gone.
  • Reporter (@gtrak) verifies on Windows / RTX 4090 / CUDA 13 — does this recover decode tok/s at 16k+ context?

What this does NOT fix

  • Residual needs_realloc events from monotonic kv_pad growth at long context — these are rare boundary crossings (every ~256 tokens), not per-step. Codex review flagged these as low-severity; not chasing unless the reporter still sees churn.
  • Codex review also flagged the StepGraph & sg = target_sg; alias as a low-severity readability footgun — a follow-up s/sg/target_sg/g would clarify. Held to keep this PR minimal.

…-decode step

Issue Luce-Org#55: every spec-decode iteration calls build_target_step_tree
(target verify, ~3127 graph nodes) and build_draft_step (draft forward,
~186 graph nodes) on the SAME StepGraph, sharing one ggml_gallocr.
ggml_gallocr_needs_realloc compares galloc->n_nodes to graph->n_nodes,
so every call sees a mismatch left over from the previous call's
opposite topology, forcing ggml_gallocr_reserve to re-walk the entire
graph (CPU cost) and often cudaFree+cudaMalloc the activation buffer
(GPU driver cost).  Reporter on Windows/RTX 4090 sees the
"graph has different number of nodes" debug log fire every step and
decode tok/s halving from 90 @ 16k context to 55 @ 32k context.

Fix: introduce target_sg and draft_sg, each with its own ggml_gallocr.
Target verify settles into the 3127-node graph topology, draft into
the 186-node topology, and neither bounces.  Existing prefill /
target-verify call sites keep their `sg` references via a
StepGraph & sg = target_sg alias; only the draft block (~10 calls)
swaps `sg.X` for `draft_sg.X`.  Daemon-mode reset and migrate-cache
sites destroy both StepGraphs.

Verified with one-line instrumentation patch on ggml_gallocr_alloc_graph
(unconditionally fprintf to stderr at each "needs_realloc returns
true" site, removing the #ifndef NDEBUG gate the upstream messages
are silenced by in Release builds).  HE prompt 00 + ddtree-budget=22 +
n_gen=256 over 26 spec-decode steps:

  Before: 56 needs_realloc events (alternating n_nodes 186 ↔ 3127),
          14 cudaFree+cudaMalloc events.
  After:  3 needs_realloc events (initial only: 0 -> 3127, 0 -> 3079,
          0 -> 186), 0 cudaFree+cudaMalloc events during decode.

bench_he.py (RTX 3090, --n-gen 128, --ddtree-budget 22, 3-run mean):
  main:     86.72 tok/s
  this fix: 84.99 tok/s
Within bench-noise on Linux/CUDA 12.6 because cudaMalloc is cheap on
this stack — the saved per-step cost is small.  The reporter's stack
(Windows/CUDA 13/RTX 4090) has a slower stream-allocator where the
saved cost should translate into measurable tok/s recovery; that
needs verification on the reporter's box.
@dusterbloom dusterbloom marked this pull request as draft April 30, 2026 12:56
@dusterbloom dusterbloom marked this pull request as ready for review April 30, 2026 13:15
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 2 files

@dusterbloom dusterbloom force-pushed the fix/issue-55-stable-kv-pad branch from 05cb709 to 0ce6832 Compare May 1, 2026 16:29
@davide221
Copy link
Copy Markdown
Contributor

@gtrak can you verify if this is solved?

javierpazo added a commit to javierpazo/lucebox-hub that referenced this pull request May 10, 2026
This change brings concurrent multi-request execution to test_dflash
on a single GPU. It is internally one cohesive unit but can be split
into four conceptual pieces if a smaller review is preferred:

1. Multi TargetCache slots
   - CLI: --target-cache-slots=N (alias --cache-slots=N)
   - prefix `SLOT <id>` routes commands to a specific slot
   - DaemonSlotState + RAII ActiveDaemonSlot for safe switching
   - LIST_TARGET_CACHE_SLOTS for introspection
   - all slots share target/draft weights; only KV/SSM/scratch is
     per-slot
   - create_target_cache gains an `n_seqs` parameter so a single
     cache can be allocated batched up front

2. Tagged stream protocol (opt-in)
   - --stream-tagged emits frames `[-2, request_id, token]` instead
     of bare int32 tokens; sentinels `-4` (CONTINUE), `-1` (DONE)
   - parser recognises `REQ <id>` / `REQUEST <id>` headers
   - legacy bare-int32 streaming is unchanged when the flag is off
   - this lets a client demux multiple concurrent requests over the
     same stdout

3. Native quantum scheduler
   - dispatch table for REQ/SLOT/START, SCHED_STEP, SCHED_DRAIN,
     LIST_REQUESTS
   - cursor-based fair round-robin between admitted requests
   - non-blocking reader thread admits new requests during a drain
   - PendingQuantum{slot, req, epoch, n_gen} carries the unit of work
   - CONTINUE / CONT resumes a slot without re-prefilling
   - REQ <id> CANCEL invalidates a request and bumps the slot epoch
     so a stale CONTINUE is rejected; RESTORE_CHAIN / legacy generate
     refuse to overwrite a slot that is owned by an active scheduler
     request

4. Fused batched target step (CUDA path)
   - new commands: SCHED_BATCH_PEEK, SCHED_BATCH_PROBE,
     SCHED_BATCH_TARGET_TAIL, SCHED_BATCH_TARGET_STEP,
     SCHED_BATCH_DRAIN
   - QwenGraphInputs gains `n_seqs`; build_delta_net_block accepts
     n_seqs > 1
   - target_feat is allocated as [5*hidden, target_feat_cap, n_seqs]
     when batched and the chain forwards capture features per-seq
   - batch_probe_compare_ok smoke shows mismatches=0 vs the
     single-seq path; SCHED_BATCH_TARGET_TAIL commits two completed
     pending quanta in 29.26 ms; SCHED_BATCH_TARGET_STEP commits the
     next batched step in 29.57 ms; SCHED_BATCH_DRAIN completes
     req12/req13 with two batched steps each
   - rollback for partially accepted draft tokens, multi-token verify
     and parent-id propagation in the batched path are noted as
     follow-ups; today the batched step accepts the cleanest case
     and falls back to single-seq when needed

Validation (single GPU1 RTX 6000 Ada sm_89, Heretic Q4_K_M target +
Q8 GGUF or FP16 safetensors drafter, FA_WINDOW=0, KV q4_0/q4_0):

- Two concurrent requests:
    REQ 4 START SLOT 0 quantum=2
    REQ 5 START SLOT 1 quantum=2
    SCHED_DRAIN closes both clean.
    slot 0: 18.41 tok/s, slot 1: 22.50 tok/s
- Mid-drain admission of REQ 6 succeeds; CONTINUE on slot 0 resumes
  without re-prefill.
- batch_probe_compare_ok mismatches=0 over a 2-seq probe.
- batch_tail_commit count=2 ms=29.26.
- batch_step_commit ms=29.57 followed by SCHED_DRAIN reverts cleanly
  back to the DFlash single-seq path.

Compatibility:
- All new behaviour is opt-in. Default invocation of test_dflash
  with no scheduler flags keeps the legacy single-request path.
- Tagged stream is gated behind --stream-tagged.
- Multi-slot is gated behind --target-cache-slots=N (default N=1).
- Batched target step is reached only via the SCHED_BATCH_* command
  family; legacy SCHED_STEP keeps using the single-seq path.
- Hot-loop diagnostic logs (sync_us / step_debug) are now gated
  behind DFLASH27B_TIMING_DEBUG / DFLASH27B_STEP_DEBUG so the
  default path is unchanged.

Verification vs existing community PRs:
- No prior art in lucebox-hub for the SCHED_BATCH_* protocol or for
  a native C++ quantum scheduler with REQ/SLOT/CONTINUE/CANCEL +
  epoch hardening. Checked against PR Luce-Org#39 (CUDA graph reuse) and
  PR Luce-Org#62 (split target/draft StepGraphs); both reuse / split graphs
  but neither exposes a multi-request slot protocol.
- No upstream collision found for tagged stream framing or
  --target-cache-slots.

Happy to split this into four sequential PRs (slots / tagged stream /
quantum scheduler / batched target step) if a smaller-grained review
is preferred — let me know.

Author: Javier Pazo <xabicasa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow decode on RTX 4090 and windows

2 participants